## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -194.00 8.76 12.30 16.26 18.30 406.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -194.00 8.76 12.30 16.41 18.30 338.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -75.00 8.30 11.80 15.22 18.36 406.00
These are all exponential distributions. All plots have interesting peaks at just under $60 and again around $70. Both taxi types also have negative fares. How can you have negative fares? Maybe if it is a refund. The median total amount for yellow taxis is higher than that for green taxis at $12.30 vs. $11.80. Given the exponential distributions, if the log is taken it should give us a normal distribution.
The log distributions for all taxis, yellow and green follow a normal distribution.
The total_amount paid however is based on many factors - fare_amount, extra, mta_tax, improvement surchage, tip_amount aand tolls_amount. It therefore makes sense to look at the basic fare before other charges and tips, as this is determined by time and distance.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -87.00 6.50 10.00 13.18 15.50 356.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -87.00 6.50 10.00 13.25 15.00 280.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -75.0 6.5 10.0 12.7 16.0 356.0
Very interesting. The median for both yellow and green taxis, as well as the overall median is $10. This also shows that tips plus other charges are $1.80 for green taxis and $2.30 for yellow on average. Looking at the fare amount, the spike occurs around $50 for the yellow taxis, but no such spike exists for the green taxis. Does this mean that tips plus other charges are roughly $8-$10 for these $50 trips? We also have outliers up to $280 for yellow taxis and $356 for green.
Most of the fares are less than $60 with the bulk occuring in the $30 or less region. This could signify that New Yorkers prefer using taxis for shorter trips.
The fare_amount still maintains the exponential distribution. Hence the removal of additional charges and and tips had no effect on the shape - $50 spike aside.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -107.00 1.30 2.30 3.08 3.65 386.00
75% of the additional charges are very small, only going up to $3.65. There is a spike around $6 corresponding to the the difference seen in the total and fare amounts. However in general, there are fewer high price additional charges beyond $5. This also indicates that new yorkers might take shorter trips, hence tip less and pay smaller or no tolls. There are however outliers as the maximum additional charges is $386.
If new yorkers take shorter trips, what does the distribution of distance and time look like?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.060 1.800 3.053 3.400 300.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.050 1.800 3.064 3.310 300.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.100 2.000 2.974 3.800 37.440
Trip activity starts to decrease above 5 miles until there is very little activity around 10 miles. and above. This coincides with what is seen in the fare amount histograms confirming that new yorkers prefer short trips. Interestingly there is a spike at a distance of 0 miles, with about 450 trips for green taxis and just under 1,500 trips for yellow taxis. This could indicate that these taxis were stuck in traffic or maybe errors.
## [1] 300.00 158.40 85.80 78.50 78.08 47.40 46.55 46.10 44.37 42.74
Remove outliers. Some towns in Westchester county are up to a 50 miles and 60 minutes from New York city, so the remaining distance observations are valid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.850 6.733 11.300 15.560 18.430 1440.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.850 6.783 11.380 15.620 18.530 1440.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.317 10.670 15.110 17.670 1437.000
These are right skewed distributions with 75% of the trips lasting just over 15 minutes. Beyond 60 minutes there are very few trips. There are some yellow taxi trips with durations less than zero minutes. This is not possible, therefore these can be erroneous data if the odometr malfunctioned for example. As a result the bottom 0.1% and the top 99.9% can be removed from the data. This can also help in eliminating some outliers. Before doing this however less take a look at the trip speeds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.89 10.78 13.99 14.56 59580.00
The taxi speeds go up to around 50mph. Beyond that trips at even greater speeds are sporadic, but not impossible. Only 75% of taxis attain speeds of 15mph and under on average. However, the summary statistics indicate that some taxis in New York are traveling at impossible speeds. Hence this unreliable data should be removed from the dataset. To make the speeds more realistic remove the top 99.9% quantile.
The 99.9% quantile indicates a speed of around 55mph. This quantile might eliminate too much of the valuable data, as some speeds above this might be perfectly valid and might be form longer (distance) trips. Therefore we first need to check other factors associated with these trips in order to determine what data to elimnate. This will be done via bivariate analysis.
##
## Cash Credit card Dispute No charge Unknown
## 88177 134934 306 757 1
##
## Group ride JFK Nassau or Westchester
## 5 4549 83
## Negotiated fare Newark Standard rate
## 1255 389 217892
## Unknown
## 2
##
## 0 1 2 3 4 5 6 7 9
## 72 161332 30506 9046 4132 11645 7440 1 1
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 9202 7019 5188 3808 2751 2452 4776 7849 9791 10037 9846 10207
## 12 13 14 15 16 17 18 19 20 21 22 23
## 10778 10518 11112 10603 9461 11212 13304 13831 13017 13248 12850 11315
##
## Mon Tue Wed Thu Fri Sat Sun
## 38183 24885 39827 34340 29572 28080 29288
Credit card and cash payments dominate by making up 60% and 39% of the payment types respectively. Less than 1% of the payments are disputed or considered no charge.
The majority of new yorkers, 97%, took trips at the standard rate. This was followed by trips to JFK with 2% and trips where the fare ws negotiated at less than 1%.
New yorkers are pretty solitary people at least when it comes to tax travel. A whopping 72% of the trips were for a single passenger only. This was followed by 2, 5, 3 and 6 passengers at 14%, 5%, 4%, and 3% respectively. It is interesting to note that there are 72 trips with 0 passengers. Does this mean the taxi was waiting for a passenger with the meter running, but the trip did not take place in the end? It will be interesting to check the time and distance for these trips.
New yorkers use more taxis during evening rush hour (5-8pm) at 23% of trips, than morning rush hour (7-10am) 17%. In fact evening rush hour sustains right through to a night time rush hour (9pm-12am) and morning rush hour lasts well until 4pm. Traffic only really falls in the early hours of the morning (1-6am) where . It might be best to group the hour of day into bands, but from this data 7am-12am can be considered rush hour traffic in New York city.
I thought the days of the weekend might prove more popular than the weekdays on average, however this is only partially true. The two main nights for entertainments Friday and Saturday have 26% of the weekly trips. In fact if we add in Sunday we get 39% of the trips occuring over these three days. It will be interesting to see if the peaks occur in the night time weekend traffic.
There are 224,177 observations in the data and it forms a 1.5% sample of all taxi data for the month of May, 2015. The dataset originally had 20 variables. Numerical features include:- vendorid, pickup_datetime, dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude. In addition it has the following payment fields (in dollars) - fare_amount (calculated by the meter and dependent on time-and-distance), extra, mta_tax, improvement_surcharge, tip_amount(credit cards only), tolls_amount, and total_amount.
The categorical features are as follows:- vendorid, store_and_fwd_flag ratecodeid: 1 = Standard rate, 2 = JFK, 3 = Newark, 4 = Nassau or Westchester, 5 = Negotiated fare, 6 = Group ride payment_type: 1 = Credit card, 2 = Cash, 3 = No charge, 4 = Dispute, 5 = Unknown, 6 = Voided trip trip_type: 1 = Street-hail (green only), 2 = Dispatch (green only), 3 = No Info (created for yellow only)
197,374 observations are for yellow taxis and 26,803 for green. Most taxi trips are short with a median price of $12.30 for the total_amount and 75% of trips costing $18.30. This is also reflected in the distance as median trip distance is 1.8 miles and 75% of trips only go up to 3.4 miles.
The main features of the dataset are fare_amount, as this is a factor of the time and distance taken for the trip.
Other interesting features are payment_type and ratecodeid as these can also effect the amount paid for the trip. Pickup location (longitude and latitude) is also an interesting feature to determine the origin of these trips and if it has any effect on the type of trip the customer will take and therefore the total_amount. The time of day and day of the week is also important in determinig the total_amount paid as Friday, Saturday and Sunday have the least traffic on average (39% of trips), and the city appears to have a 7am-1am rush hour.
I created trip_duration in minutes by finding the time interval between the pick_datetime and dropoff date_time. I also created some time series information extracted from both pickup and dropoff datetimes using the lubridate package. This results in the new fields year, month, day, hour, minute, second, yday (day of the year), and wday (day of the week) for both pickup and dropoff. In addition, an id field was created to give each observation its own unique identifier. A type field was created to categorize yellow taxis separate from green taxis. A field called trip_speed, in miles per hour (mph), while not added to the dataset, was created and used to determine if some of the underlying distance and duration data was realistic.
The distributions were standard exponential distributions for all of the numerical fields. Taking the logs of some of them resulted in the log normal distribution. They are howevr outliers for all of the numerical features. For example the maximum total_amount paid is $406, while the maximum distance is 300 miles and the max tip_amount is $386.
I created trip_speed to investigate obsrevations where trip speed might be unrealistic. These speeds were removed from the dataset. As discussed above, the lubridate package was used to create time eries data from dates. Most of the categorical data is in integer format. I created factors from these numbers and assigned user friendly labels. E.g. the payment_type and ratecodeid fields. I also dropped the unused ehail-fee from the green taxi data and converted all column names to matching lowercase so tha they can be stored in one data frame. There are still outliers in the data and data that might be considered an error, but before making the decision to remove or adjust this data I will need to do a bivariate and multivariate analysis.
As expected there is a strong correlation between fare_amount and trip_distance at 0.942 and to a far weaker extent trip_duration at 0.193. This appers that not many taxis were stuck in traffic as the rates change from distance based to time based if the taxi is not moving fast enough. The correlation between total_amount and fare_amount is such that at 0.982 any analysis can be done on the fare_amount and still lead to a good approximation of the total.
There is also a weak correlation between passenger_count and trip_distance at -0.014, and fare_amount and location (pickup longitude and latitude) at ±0.01. However logitude and latidue might not be the best representation of location.
Based on the results of the univariate analysis, we take a look at some of the relationships between variables.
There is a linear relatonship bewteen fare_amount and trip_distance. However given the high correlation between fare and distance it is interesting to note that yellow taxis make twice as much money, $100 in general at upt to 30-35 miless vs. green taxis making up to $50 for up to 25 miles. The green taxis cover a smaller distance (difference of 5-10 miles) and make half the money. I wonder if this depends on the type (category) of trips being made.
Interesting points to note on plots: Vertical line at distance = 0 miles for varying fares (yellow and green) Horizontal line at fare = $0 for varying distances. (yellow and green) Horizontal line at around $52. There is a flat rate of $52 from anywhere in Manhattan to JFK airport. This will account for this line. However the green taxis do not seem to make these trips.
Looking at the relationship bewteen fare amount and trip_distance it is clear the yellow taxis make more money on average up to around 21 miles and $52. After this green taxis make more money up to 35 miles with an average fare_amount of $98 vs. $78 for yellow. A whopping $20 difference. beyond 35 miles however there is no data for green taxis. What kind of trips are occuring between 0 and 21 miles, and over 21 to 35 miles? To anser this question a multivariate analysis was done by ratecodeid.
The difference in fare is mainly due to the standard rate above 25 miles. Green taxis also earn more in average fare for Nassau/Westchester trips from 3-20 miles and Newark trips from 15-25 miles.
What kind of relationship exists between fare_amount and trip_duration?
##
##
## Correlation between fare and trip duration
## [1] 0.2317463
The relationship between fare amount and trip_duration is not linear as shown by the 0.193 correlation result from the scatter matrix via the decreasing fares as trip duration goes beyond 40 minutes. This also due to the vertical trend at trip_duration = 0 and the horizontal line for trips to JFK airport at $52 (yellow taxis only).
However based on taxi operations fare is directly related to distance and duration. Therefore, distance is directly related to duration, and the relationship should be more linear. To help eliminate some of what might be error values I removed the lower 1% of trip_duration values and also the top 99.95%, a total of 44 observations.
##
##
## Correlation between fare and trip distance
## [1] 0.9418471
##
##
## Correlation between fare and trip duration
## [1] 0.8658479
##
##
## Correlation between trip distance and duration
## [1] 0.7817089
The correlations between fare and distance remain strong at 0.94. However once the outliers and potential error data was removed the correlation between fare and trip duration improved from 0.193 to 0.87. This coincides with other analysis done as fare_amount is based on time-and-distance and the following can be stated:-
A linear relationship exists between fare_amount and trip duration. It is not non-linear and confirms the fare_amount and time-and-distance relationship.
New York city appears to have a rush hour lasting from 7am-1am. This implies that time-based rates are charged during these hours and hence fare and duration should be a linear relationship.
Trip distance and trip duration themselves exhibit multicollinearity, hence only one of them will be used in any model I build.
At this stage we know that Credit cards and Cash dominante payment types (89%), and Standard rate trips dominate ratecodeid (97%). What do the median values for fare_amount look like split across the values of each of these categories.
## taxidata$payment_type: Cash
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.00 6.50 9.00 12.15 14.00 253.50
## --------------------------------------------------------
## taxidata$payment_type: Credit card
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 7.0 10.5 13.8 16.0 280.0
## --------------------------------------------------------
## taxidata$payment_type: Dispute
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -52.00 4.00 7.50 10.48 13.50 64.50
## --------------------------------------------------------
## taxidata$payment_type: No charge
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -75.00 4.50 9.50 14.05 17.50 100.00
## --------------------------------------------------------
## taxidata$payment_type: Unknown
## NULL
Credit card payments on average are $10.50 which is higher than cash payments at $9.00. Note that credit card payments include tips and cash does not. The fares that are disputed or result in no charge on average are $7.50 and $9.50 respectively. These are very low fares to dispute and likewise the no charge trips probably occured for reason other than price in some cases
The minimum value for cash payments is -$5. If a payment is categorize as Cash, then should we have negative fare_amount. Perhaps these are errors and the data can be adjusted accordingly. Dispute and No charge also have negative fare that can be due to a refund, but will also be removed from the dataset as this cannot be verified.
There are quite a number of outliers for both cash and credit cards. This suggests that while new yorkers favor short trips there are still opportunities for taxis to make money on the longer trips.
## taxidata$ratecodeid: Group ride
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.50 3.25 4.00 4.00 4.75 5.50
## --------------------------------------------------------
## taxidata$ratecodeid: JFK
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -52.00 52.00 52.00 51.91 52.00 60.89
## --------------------------------------------------------
## taxidata$ratecodeid: Nassau or Westchester
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.50 34.38 56.25 59.60 78.88 144.50
## --------------------------------------------------------
## taxidata$ratecodeid: Negotiated fare
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -75.00 8.00 17.00 35.13 52.00 280.00
## --------------------------------------------------------
## taxidata$ratecodeid: Newark
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 62.00 66.00 66.87 72.00 117.50
## --------------------------------------------------------
## taxidata$ratecodeid: Standard rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.00 6.50 9.50 12.19 14.50 113.00
## --------------------------------------------------------
## taxidata$ratecodeid: Unknown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.54 20.66 29.77 29.77 38.88 48.00
As expected the longer distance trips and JFK airport on average make the most money. Group rides however are not lucrative and don’t appear to occuar very often as the 1st, median and 3rd quartile only differ by a $1.50. While standard rate are 97% of trip, on average they have very low fare at $9.50, but also have a wide range of values and outliers up to $113.
Fares less than $0 I will assume are errors (maybe from meter malfunctioning) and remove from the dataset.
Regardless of when a trip is taken the fare hovers between $9-$11.
The focus here was on fare_amount as this is the key variable determining total trip costs. A scatter matrix was done and the correlation between total_amount and fare_amount is 0.982. This indicates that any analysis can be done on the fare_amount and still lead to a good approximation of the total.
The strong correlation between fare_amount and trip_distance was confirmed via graphs. However a few intersting poings were noted:-
What appeared to be a weak correlation between fare_amount and trip_duration, and frankly did not make any sense was shown to be misleading. An investigation was done into why plots of fare_amount vs. trip_duration showed varying fares up to $200 for trip durations at or around 0 minutes. These values were removed as the lowest 1%of trip duration data and are considered possible meter malfunction errors. A reassessment of fare duration correlation showed that the relationship is indeed linear.
7am-1am is rush hour traffic - as shown in the bar graph of pickup hour (pckhour). If this is the case then taxis should be constantly stuck in traffic and the rates switched from distanced based ($0.50 per 1/5 miles) to time based $0.50 per 60 seconds. Given the amount of short trips, passengers should be paying mainly time based rates. The exceptions of course are JFK flat rates, trips taken between 1am-6am, and trips going for longer distances - Newark, Nassau/Westchester.
The relationship between fare_amount or more specificallyy negative fare amounts and ratecodeid. This implied errors in the meter malfucntioning occured. This data was removed.
Confirmation of the relationship between fare_amount and the time-and-distance relationship. This also by default validated the pickup hour (pckhour) data from the univariate analysis section. The city has about 19 hours of rush hour traffic and hence a lot of these trips should be based on time-based rates. This in turn should be reflected in a highly correlated relationship between fare_amount and trip_duration which was shown to be a value of 0.87
Let’s take alook at all dollar values and how they vary with each other and distance
Total amount paid is mainly due to the base fare, tips and toll charges. These values can be included in any model. The effect of other charges is minimal. This is also reflected in the information at http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml as these are flat charges regardless of distance traveled. It is also interesting to note that the higher fare the bigger the tip.
The base fares make up 75% or more of the final amount paid. Tips are in general around 15-16%. Toll charges are $0 for shorter distances, but from about 16 miles can go up to 8-9%. at around 46 miles and just under 5% tolls again start to trend downwards.
Let’s take a closer look at fares, tips and tolls, as these are the three dominant factors in the final total.
All categories of cash and credit card payments (JFK the exception) and the standard rate category for rate codes maintain the linear trend between fare_amounts with increasing distance. The vertical line around 0 miles occurs mainly for credit card payments. This could imply that the odometer was not working for these trips. It appears as if there is not much difference in fare_amount between cash and credit paying customers.
We only have tip information for credit card customers.
More tips are paid by credit card customers for JFK and Newark rates. This confirms what we saw previously with the line graphs that the longer the distance the more likely a customer is to pay tips.
Now that we have some insight into the amount customers pay by payment type and rate codes. Let’slook at how these categories are affected by pickup location.
This shows that some of the data is definitely error observations, as it is impossible for the taxis to drive to location (0,0) and the other cluster around 57 degrees latitude. Remove these errors from the data and most of the observations where the meter was suspected of malfunctioning would likely disappear.
Removing points at (0,0), in the middle of the ocean, and somewhere in Georgia, leaves the remaining points in the states of New York, New Jersey and Connecticut.
However we have another problem as this plot is not very useful for any type of analysis. The data looks like a big cluster of one location due to the nature of longitude and latidue coordinates. I will change location to something more meaningful - boroughs.
## OGR data source with driver: GeoJSON
## Source: "nyc_boroughs.geojson", layer: "OGRGeoJSON"
## with 104 features
## It has 3 fields
## [1] NA
## [1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
Add coordinates for the center of each borough
##
## Bronx Brooklyn Manhattan Queens
## 0.85 6.47 84.28 8.34
## Staten Island Outside Boroughs
## 0.00 0.06
We have already determined that the most important prices are fare amount, tips, tolls, and total amount.
## [1] "Median fares by borough"
## Source: local data frame [6 x 3]
##
## borough fare n
## (fctr) (dbl) (int)
## 1 Bronx 10.0 1870
## 2 Brooklyn 11.5 14159
## 3 Manhattan 9.5 184408
## 4 Queens 23.0 18244
## 5 Staten Island 20.0 2
## 6 Outside Boroughs 13.5 129
## [1] "Median fare for entire dataset"
## [1] 10
On average the highest fares are made when the pickup location is in Queens at $23 and the lowest in Manhattan at $9.50. We know that Manhattan accounts for 84% of all taxi traffic. The overall average for taxis is $10.00. It is likely that a driver stands to make more money for pickup locations in Queens. This is even more important as there are 10 times more trips taking place in Manhattan than Queens. Manhattan customers are not traveling as far on average so that can account for low fares, but what accounts for the high fares startingg in Queens?. JFK is located in Queens, and with a flat rate of $52, this could cause the increase seen in average earnings for these trips. Trips originating outside of the city limits have fares of only $13.50 on average. This is likely due to the longer distances.
##
## Cash Credit card Dispute No charge Unknown
## 85751 132310 220 531 0
## numeric(0)
## Source: local data frame [6 x 4]
##
## fare payment_type borough n
## (dbl) (fctr) (fctr) (dbl)
## 1 8.5 Dispute Bronx 7
## 2 7.5 Dispute Brooklyn 17
## 3 8.0 Dispute Manhattan 163
## 4 13.5 Dispute Queens 33
## 5 NA Dispute Staten Island 0
## 6 NA Dispute Outside Boroughs 0
Staten island has the second highest credit card paying customers when it comes to fare_amount at $28, but on closer inspection only 1 customer falls into this category. Any realistic interpretation of this analysis will reject small sample sizes as this will only lead to misleading conclusions. To aid the viewer sample sizes are indicated by the colour of the borough.
Queens customers pay the highest fare overall at $30.50 and use credit cards to do so. This is two and a half times that paid on average by Brooklyn customers at $12.50 and twice that paid by those with pickup points outside of the boroughs. It is likely that JFK customers prefer to pay by credit card. Queens also has the highset paying cash customers at $13. Brooklyn customers also have the highest payments for disputed customers at $13.80, Manhattan has the highest number of disputed customers at 133 vs. Brooklyn with 30 and Queens with 40. This is expected as 87% of the taxi traffic originates form Manhattan.
What accounts for the high fares paid by Queens customers. Could it be associated with the rates or the distance tarvelled. We already know most new yorkers take short trips up to 5 miles, so are these Queens customers the ones going for longer distances or JFK customers. Lets look at these observations by ratecodeid to verify.
## fare payment_type ratecodeid borough n
## 60121 52.00 Credit card JFK Queens 1549
## 62569 55.50 Credit card Nassau or Westchester Queens 25
## 65017 70.00 Credit card Negotiated fare Queens 60
## 67465 106.75 Credit card Newark Queens 4
## 69913 26.50 Credit card Standard rate Queens 7544
87% of all trips are made from Manhattan. Of these, 98% are made using the standard rate or made to Newark. The standard rate average fare of $8.50 and $10 for cash and credit card payments does not differ very much from the overall taxi trip average of $10. However as shown above, the same cannot be said for Queens where the cash customers pay $1 more on average and credit card customers pay more thah 2.5 times the average at $26.50. Half of the JFK revenue per trip can be earned in Queens, but at alsmost 5 time the amount of customers. Who are these standard rate Queens customers that are roughly 7,500 compared to Manhattan customers in the 100,000s, yet they are paying far more money for taxi trips. This could indicate that these standard customers are taking longer trips or caught in bad traffic.
JFK customers number only 1,549 customers from Queens for an estimated total $80,548 in fares vs. standard rate customers from the same borough with $199,916 in estimated fares.
Even for taxis going to Nassau/Westchester or Newark from Queens, fares are just above/below twice that for trips originating in Manhattan.
This will verify if the increased average fares for Queens customers is affected more by the 16 hour New York City rush hour traffic. To do this analysis I created a categorical variable from the pckhour field and labeled appropriately. Intervals were chosen based on the univariate analysis of pckhour. Early Morning = 2am - 6am Morning Rush Hour = 7am - 10am Afternoon Rush Hour = 11am - 3pm Evening Rush Hour = 4pm - 7pm Nighttime Rush Hour = 8pm - 10pm Late Night Rush Hour = 11pm - 1am
## [1] "AVERAGE FARE AMOUNT FOR BY TIME OF DAY"
## taxidata$time_of_day: Early Morning
## [1] 10
## --------------------------------------------------------
## taxidata$time_of_day: Morning Rush Hour
## [1] 10
## --------------------------------------------------------
## taxidata$time_of_day: Afternoon Rush Hour
## [1] 9.5
## --------------------------------------------------------
## taxidata$time_of_day: Evening Rush Hour
## [1] 10
## --------------------------------------------------------
## taxidata$time_of_day: Nighttime Rush Hour
## [1] 9.5
## --------------------------------------------------------
## taxidata$time_of_day: Late Night
## [1] 9.5
##
## ---------------------------------------------
Let’s keep the focus on Manhattan and Queens due to the fact that Manhattan has the most trips in the dataset, yet trips originating in Queens earn higher fares.
The overall average fare is between $9.50-$10 and does not vary by time of day. However if we look at the figure we see that trips from Queens make the most money during any period of the day. This is especially true for credit card customers in the Early Morning and from the Afternoon Rush Hour to Late Night ending at 1am where the average fare varies from $21-$30.50 and is up to 3 times that of those originating in Manhattan. Morning Rush Hour customers pay $3 more on average than Manhattan customers at $13. Cash paying customers on the other hand pay about $2.50-$5.50 dollars for all time periods except Early Morning and Morning Rush Hour where the earnings are comparable with Manhattan customers and the overall time period averages.
In fact median fares from Manhattan do not vary from the overall averages for each time of day category. This is expected as trips from Manhattan makeup 87% of the data and hence will dominate the median averages.
No matter what time of day a taxi works it is likely to make money if it is a non-JFK trip originating in Queens.
This will confirm which is the most important factor for these standard rate Queens and Manhattan customers, time of day or distance traveled.
## [1] NA
## [1] "AVERAGE FARE AMOUNT TO DESTINATION BOROUGHS - STANDARD RATES"
## taxidata$destborough: Bronx
## [1] 14.5
## --------------------------------------------------------
## taxidata$destborough: Brooklyn
## [1] 14.5
## --------------------------------------------------------
## taxidata$destborough: Manhattan
## [1] 9
## --------------------------------------------------------
## taxidata$destborough: Queens
## [1] 18.5
## --------------------------------------------------------
## taxidata$destborough: Staten Island
## [1] 53
## --------------------------------------------------------
## taxidata$destborough: Outside Boroughs
## [1] 60
##
## ---------------------------------------------
##
## % of trips starting and ending in Manhattan
## [1] 90.68
##
## Total no. of trips starting in Manhattan
## [1] 184408
##
## % of trips starting and ending in Queens
## [1] 48.25
##
## Total no. of trips starting in Queens
## [1] 18244
On average trips originating in Queens make more money than those starting in Manhattan for standard rates trips. The only exception are trips beginning and ending in Queens. From these graphs and tables however we can see the main the reason trips originating in Manhattan on average make far less money than those originating in Queens. 91% of trips starting in Manhattan remain in Manhattan or in other words short trips. This is versus 48% of trips starting in Queens and remaining in Queens.
In addition, for longer distances, trips starting in Queens and going to other boroughs make from $14 up to $23 (for Outside Boroughs) more than those from Manhattan heading to those same destinations. The only exception to this “rule” is when the destination is Queens, as in these cases trips originating in Queens will be considered short trips.
Distance also plays a role in addition to time of day in the amount of money earned per trip on average. This data shows that standard rate trips from Queens throughout the day will make more money per trip when going to destinations outside of the Queens borough. We saw in the previous section time of day is also a factor as outside of the Early Morning and Morning Rush Hour cash customers, trips from Queens exceed Manhattan earnings by up to 3 times the fare.
Finally given the volume of yellow taxis in this dataset around 87%. It can be concluded that most of these lower paying Manhattan-to-Manhattan trips on average $9 are being done by yellow taxis.
Yellow (Medallion) taxis are concentrated in Manhattan, but are allowed to pickup passengers anywhere in the five boroughs. Green (Boro) taxis are only allowed to pickup passengers from the streets in Upper Manhattan, the Bronx, Brooklyn, Staten Island and Queens (exceptions being LaGuardia and JFK airports). Howewver, they can pickup passengers from airports if it is a pre-arranged trip. Green taxis can drop passengers anywhere.
Green taxis were introduced in 2013 with the goal of improving access to street hail taxis and to serve areas traditionally underserved by the yellow taxis.
Both types of taxis are governed by the same rates if it is a street hail, but the base sets the rates if the trip is pre-arranged.
An interesting point to note is that the GPS tracker on green taxis does not allow the meter to work if the pickup location is inside of Upper (northern) Manhattan or located in the airports. Could this be the reason why some trips have recorded coordinates of (0,0)?
Given that the base sets the rates for dispatched trips, let’s look at the earnings of the green and yellow taxis by trip_type. To continue from the last few sections, the focus will be on standard rate trips first in order to determine who is making these high earning standard rate fares.
## [1] "AVERAGE FARE AMOUNT BY BOROUGH"
## taxidata$borough: Bronx
## [1] 10
## --------------------------------------------------------
## taxidata$borough: Brooklyn
## [1] 11.5
## --------------------------------------------------------
## taxidata$borough: Manhattan
## [1] 9.5
## --------------------------------------------------------
## taxidata$borough: Queens
## [1] 23
## --------------------------------------------------------
## taxidata$borough: Staten Island
## [1] 20
## --------------------------------------------------------
## taxidata$borough: Outside Boroughs
## [1] 13.5
##
## ---------------------------------------------
Street-hail-Y = Yellow taxis Street-hail-G = Green taxis Dispatch = Green taxis
The data is divided by trip_type. Yellow taxis are street-hail taxis only, while green taxis can be either street-hail or disptach. Therefore the graphic also gives some insight about the break down of average fare_amount by yellow and green taxis.
Compared to green taxis, yellow taxis make the highest fares on average from trips originating in Queens at $31.50 compared to $11 for green taxis (street-hail) and $9.5 (dispatched) for credit card customers. Likewise for cash paying customers, yellow taxis make $26 on average fares versus $8.50 for green taxis (street hail) and $8 (dispatched). Yellow taxi may be based in Manhattan, but the standard rate trips originating in Queens represent 3 times the average earnings of $10 for all taxis. Compare this to green taxis who are based in Queens but are earning a third of what yellow taxis earn in this borough. N
There might be opportunities for green taxis to make money from dispatched trips originating in Brooklyn with customers paying on average $15.50 and $17.50 dollars for cash and credit card trips respectively.
Given the regulations set by TLC, there is nothing to stop green taxis going after these lucrative Queens customers.
Considering that JFK and LaGuardia are both in Queens, and that green taxis are prohibited from airport-based street hails, lets take a look at the effect of JFK trips on these average fares.
Standard rates and JFK - The yellow taxi earnings from Queens only increased by a mere $4 (credit card) and $6 (cash) when JFK traffic was taken into consideration. The traffic for other boroughs had little to no increase. This shows the dominance of standard rate earnings when a trip begins in Queens.
Due to the nature of the yellow taxi sector all of these trips are street-hail. It is interesting that the root of these high earnings are not the JFK trips (or Manhattan pickups) due to the regulations preventing green taxis from pickups in the Queens based LaGuardia and JFK airports and Upper Manhattan, thus giving yellow taxis a perceived advantage in these areas. In theory this sector should earn more from these JFK customers, yet their higher earnings on average are coming from standard rate customers in the borough where green taxis are based.
Note in this sample there are only 68 JFK trips for green taxis. This suggests that in the original dataset JFK trips are much smaller in number for the green sector, which ties in with the restrictions set by the TLC.
The green taxis on average make fares higher than average fares for street-hail credit card trips starting in Outside Boroughs ($26.25) and dispatched trips from Brooklyn ($15.50, $17.50). The sample sizes are small, less than 100 in this case. Compare this to the sample sizes for yellow taxis originating in Queens and Manhattan in order of 10,000s and 100,000s.
## [1] "AVERAGE FARE AMOUNT BY TRIP TYPE"
## taxidata$trip_type: Dispatch
## [1] 12
## --------------------------------------------------------
## taxidata$trip_type: Street-hail-G
## [1] 10
## --------------------------------------------------------
## taxidata$trip_type: Street-hail-Y
## [1] 10
##
## ---------------------------------------------
This confirms that rates for longer trips such as Newark and Nassau/Westchester have an effect on green taxis that have been dispatched. The sample sizes are small, of the order of 10s, and the issue will need further exploration with larger samples. this sector can almost triple their earnings from credit card customers for trips originating in Queens to $27.50, double their earnings to $33 for trips starting in Brooklyn and now include Manhattan customers with earnings of $20 on average.
Contrast this with yellow taxis where average fares remain the same same when these distance rates are included in the trip profile. This confirms the bulk of yellow taxi higher earnings come from Standard rate trips originating in Queens, followed by JFK trips also originating in Queens.
Street hail green taxi trips to Outside Boroughs also increased average credit card fares when the longer distance rates of are taken into consideration, from $26.25 to $32. Again the sample sizes are small showing that the market being serviced is smaller.
It should also be noted that starting in january 2015, prices for yellow taxi trips dropped considerably due to competition from Uber etc. https://en.wikipedia.org/wiki/Taxicabs_of_New_York_City http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.shtml http://www.nyc.gov/html/tlc/downloads/pdf/2016_tlc_factbook.pdf
Given the smaller amount of green taxis in New York compared to yellow taxis, how can we compare these yellow taxis with the green taxis using the current sample - including categories with very small sample sizes, and still draw a credible conclusion about what is the best taxi to drive (average fare earned), when and where? This can be acheived doing a cluster analysis to segment all taxi activity.
In this analysis I created a new variable called weekday. Given an earlier univariate analysis where we found that the lower traffic occurred on Friday, Saturday and Sunday, I decided to create a three day weekend. Data in the taxidata sample is unbalanced in favour of yellow taxis at 87%. The cluster model was build by undersampling the yellow taxi data, and limiting trips in yellow taxis starting in Manhattan to 50% of the sample.
##
##
## % of each taxi sector
##
## green yellow
## 12 88
##
##
## Clusters sorted in descending order of average fare:
## group fare borough destborough trip_distance payment_type
## 15 15 52.0 Queens Manhattan, Queens 17.600 Credit card
## 7 7 19.0 Brooklyn Manhattan 4.925 Credit card
## 10 10 16.5 Queens Queens 4.450 Credit card
## 2 2 16.0 Queens Queens 3.825 Credit card
## 16 16 10.5 Manhattan Bronx 2.180 Credit card
## 4 4 10.0 Manhattan Manhattan 1.710 Credit card
## 14 14 10.0 Brooklyn Brooklyn 2.060 Cash
## 1 1 9.5 Queens Queens 1.820 Cash
## 6 6 9.5 Manhattan Manhattan 1.870 Credit card
## 12 12 9.5 Brooklyn Brooklyn 2.000 Credit card
## 13 13 9.5 Brooklyn Brooklyn 1.860 Credit card
## 11 11 9.0 Brooklyn Brooklyn 1.690 Cash
## 3 3 8.5 Manhattan Manhattan 1.490 Cash
## 9 9 8.5 Bronx Bronx 1.630 Cash
## 5 5 8.0 Manhattan Manhattan 1.500 Cash
## 8 8 8.0 Queens Queens 1.500 Cash
## ratecodeid trip_type weekday taxi_type n
## 15 JFK Street-hail No yellow 65
## 7 Standard rate Street-hail No green 174
## 10 Standard rate Street-hail No green 175
## 2 Standard rate Street-hail Yes green 144
## 16 Standard rate Street-hail Yes green 49
## 4 Standard rate Street-hail Yes yellow 900
## 14 Standard rate Street-hail No green 230
## 1 Standard rate Street-hail Yes green 243
## 6 Standard rate Street-hail No yellow 866
## 12 Standard rate Street-hail No green 264
## 13 Standard rate Street-hail Yes green 158
## 11 Standard rate Street-hail Yes green 157
## 3 Standard rate Street-hail Yes yellow 572
## 9 Standard rate Street-hail No green 129
## 5 Standard rate Street-hail No yellow 649
## 8 Standard rate Street-hail No green 225
##
##
## Silhouette of cluster results - closer average silhouette width is to 1, then the better the cluster algorithm performed.
##
##
## Average silhouette width is:
## [1] 0.6931765
I built the model using hierchical clustering. A second sampling of the sample occured to limit data to 5,000 observations. This is due to the fact that the distance matrix was too large given the limitations of R and memory. This is where big data with Hadoop has an advnatage. To create 5,000 observations I oversampled the smaller data subsets, e.g. when trip type is Dispatched.
The model was build using fare_amount, trip distance borough, desination tborough, payment type, rate code id,trip type (e.g. Street-hail or Dispatched) , and weekday/weekend variables. The type (yellow or green taxi) was added to the results (post-clustering) to determine which sector was making avergae fare for a segment.
The model returned shows that after JFK trips, on average trips starting in Queens going to Manhattan for credit card paying, standard rate, street-hail, weekday customers, make the most money at $38.50 (cluster 2). This coincides with the conclusions from the exploratory analysis. In addition it also shows two other potential segments that were not obvious from the EDA. Cluster 15 starting and ending Queens, on the weekends, with cash paying customers and avergae fare of $25.75. Cluster 14 going from Manhattan to Brooklyn also on the weekend, with credit card customers and average fare $19. All three money making segments are for the yellow taxi sector.
It also shows the value of combining EDA with modeling as the number of simultaneous variables can be far greater than what can be plotted on a graph and gives insights into groups that cannot be seen from visualizing 2-4 variables. For example when I include time of day (Morning Rush Hour etc.) in the cluster, the model gives an entirely different results. If time of day was important to the green sector in determining where it can earn revenue then the recomendation (answer) would change.
Converting the pickup and dropoff longitude and latitude variables to New York City borough information changed the analysis of the dataset completely. Some of the insights gained are repeated or reinforced from the individual sections above:
84% of the trips start in Manhattan.
Despite the above, on average the highest fares are made when the pickup location is in Queens at $23 and the lowest in Manhattan at $9.50.
Payment type: Queens customers pay using credit cards pay the highest fare overall at $30.50. This is twice that paid by those with pickup points outside of the boroughs and is 2.5 times that paid on average by Brooklyn customers at $12.50. Queens also has the highset paying cash customers at $13
Rate code id: The standard rate average fare of $8.50 and $10 for cash and credit card payments does not differ very much from the overall taxi trip average of $10. Howeve, the same cannot be said for Queens where the cash customers pay $1 more on average and credit card customers pay more thah 2.5 times the average at $26.50. Who are these standard rate Queens customers that are in the 10,000s compared to Manhattan customers in the 100,000s, yet they are paying far more money for taxi trips.
Time of day: Trips from Queens make the most money during any period of the day. This is especially true for credit card customers in the Early Morning and from the Afternoon Rush Hour to Late Night ending at 1am where the average fare varies from $21-$30.50 and is up to 3 times that of those originating in Manhattan. Morning Rush Hour customers pay $3 more on average than Manhattan customers at $13. Cash paying customers in Queens on the other hand pay about $2.50-$5.50 dollars for all time periods except Early Morning and Morning Rush Hour where the earnings are comparable with Manhattan customers and the overall time period averages.
On average trips originating in Queens make more money than those starting in Manhattan for standard rates trips. The only exception are trips beginning and ending in Queens. In fact 91% (167,229/184,408) of trips starting in Manhattan remain in Manhattan or in other words short trips. This is versus 48% (8802/18244) of trips the trips starting in Queens and remaining in Queens.
In addition, for longer distances, trips starting in Queens and going to other boroughs make from $14 up to $23 (for Outside Boroughs) more than those from Manhattan and heading to other boroughs.
The data shows that standard rate trips from Queens throughout the day will make more money per trip when going to destinations outside of the Queens borough.
Finally given the volume of yellow taxis in this dataset around 87%. It can be concluded that most of these lower paying Manhattan-to-Manhattan trips on average $9 are being done by yellow taxis.
Standard rates only
The high fares for standard rate trips are being made by yellow taxis. They make the highest fares on average from trips originating in Queens at $31.50 compared to $11 for green taxis (street hail) and $9.5 (dispatched) for credit card customers.
Likewise for cash paying customers, yellow taxis make $26 on average fares versus $8.50 for green taxis (street hail) and $8 (dispatched). Yellow taxi may be based in Manhattan, but the trips originating in Queens represent 3 times the average earnings of $10 for all taxis. Compare this to green taxis who are based in Queens but are earning a third of what yellow taxis earn in this borough for standard rate traffic only.
There might be opportunities for green taxis to make money from dispatched trips originating in Brooklyn with customers paying on average $15.50 and $17.50 dollars for cash and credit card trips respectively. This is twice that paid by Queens customers.
Standard rates and JFK
The yellow taxi earnings from Queens only increased by a mere $4 (credit card) and $6 (cash) when JFK traffic was taken into consideration. The traffic for other boroughs had little to no increase. This shows the dominance of standard rate trips in the yellow taxi sector. In theory the flat rate trips fom JFK at $52 should significantly have improved the average fares.
Standard rates, JFK and longer distance rates - on including the fares for trips to Newark and Nassau/Westchester, green taxis experience a high jump in existing earnings. Rates for longer trips such as Newark and Nassau/Westchester have on effect on green taxis that have been dispatched. While the sample sizes are small, this sector can almost triple their earnings from credit card customers for trips originating in Queens to $27.50, double their earnings to $33 for trips starting in Brooklyn and now oinclude Manhattan customers with earnings of $20 on average. There is also some improvement seen in earnings from cash customers.
Contrast this with yellow taxis where average fares remain the same same when these distance rates are included in the trip profile.
Street hail green taxi trips to Outside Boroughs also increased average credit card fares when the longer distance rates of are taken into consideration, from $26.25 to $32. Again the sample sizes are small showing that the market being serviced is small.
Overall there is no difference between the average fares earned by yellow taxis and green taxis based on street hails at $10 per trip. Green taxis have a slight advantage when it come to disptach trips at $12. However, when originating borough is added to the mix a much clearer picture emerges about where the highest fares are earned and by whom. In this case yellow taxis starting in Queens and going on standard rate trips for credit card and cash paying customers.
Based on the current sample, if the green taxi drivers wanted to improve their revenue, they need to do one of three things.
Focus on increasing the market size of long distance credit card customers that use, the dispatch services. This can be done in all boroughs, but with particular attention paid to Queens and Brooklyn.
Get more information about the standard rate customers that are serviced by yellow taxis in Queens. Some of this data is location related and is already available to drill down to the neighborhood level. In which neighborhoods are these pickup trips happening? It might also involved some primary market research by the green taxi sector.
While JFK trips have only increased yellow taxi revenue by $5-6 this is also a third option for green taxis. However, it would involve petitioning to get the current regulations change. This might not be worth the smaller potential increase in fares compared to the more lucrative segments of long distance rates and Standard rate trips originating in Queens.
The effect of credit card paying, standard rate, yellow taxi customers originatiiong in Queens on the average fare_amount. This very surprising considering yellow taxis dominate the JFK market and have the backing TLC regulations to support it. This would lead one to assume that a larger effect on average fares is expected by this customer segment and not the mere increase of $5-6 above the standard rate fare averages.
Given the fact that we are looking at the best customer segments for taxi earnings a cluster model was created. The cluster included the following fields - based on the exploratory analysi conducted. weekday, borough, destborough, payment_type, ratecodeid, trip_distance, trip_type, and fare_amount.
Hierarchical clustering was done on a sample of the data due to the limitations of memory on my machine.
Strengths 1. The original data is unbalanced with only 14% of observations being green taxis. To balance the data set I chose to undersample the yellow taxi data. 2. Visually examining the dendrogram resulting from the hierarchical cluster allowed a good choice of k - the number of clusters. k was also chosen based on the knowledge gained from the exploratory data analysis. 3. Using gower distance for the dissimilarity matrix allowed the use of mix tytpe variables (numerical and categorical). 4. Clustering itself extends the exploratory data analysis by finding finding patterns in multivariate data. Graphs are limited by the amount of dimensions that are human “readable”, but can prove useful in determining which features to focus on in a model.
Limitations 1. Manhattan trips by yellow taxis dominate the data set. A sampling method is needed tht will weight the data such that these trips do not dominate the sample. 2. In marketing and customer segmentation hierchical clustering is used as a first step before predictive (regression) models or built. Regression could have been the next step in this process to predict likely fares. 3. Due to the model being clustering only, I chose not to split it into a training and testing set. Having a test set is best to validate the model performance. 4. Lack of physical memory limited the clustering to 5,000 observations. Ideally the entire 200,000 dataset would have been ideal. 5. Again due to lack of memory undersampling was done. However, the dominance of yellow taxi Manhahattan data indicates that oversampling of green taxi data to increase it from approximately 27,000 rows could have been a better choice.
A look at how there are some opportunities where green taxis earn more than yellow taxis. This is inspite of yellow taxis dominating the market and in this particular sample dataset having 87% of the market. They earn slightly more for Nassau/Westchester rates bewteen 6-13 miles, a little more for Newark trips from 13-25 miles, and Standard rate trips 25-35 miles
However breaking down the data further by borough and payment type gave further insight into where the highest fares were being earned and by whom. In this case yellow taxi (street-hail) customers, paying by credit car ($30.50) or cash ($26), for trips beginnimg in Queens (non-airport trips). Disptach green taxi trips to Outside Boroughs also make money at standard rates ($26.25). This confirms what is seen in Plot One.
This reflects the effect of JFK and long distance rates (Newark, Nassau/Westchester) on average fares. We know yellow taxis make their highest earnings from standard rate trips. However the effect of JFK trips is much smaller only increasing the average fares by $5-$6. The graph also confirms what is seen in Plot One. There are opportunities for green taxis to make money. In this case the graph gives more details. Dispatched credit card trips starting in Queens, Brooklyn, and Manhattan make $27.50, $33 and $20 respectively.
This proved to be a very interesting dataset. Errors in the dataset can be discovered early if you know exactly what features are of interest and you get descriptive statistics associated with these features. However this is not always possible as you might not know what features will prove the most useful ahead of time and some datasets have hundreds, if not thousands of features. A good rule of thumb appears to be to check the range variables in the data should have by exploiting any existing domain knowledge. However in cases where this is not possible, then error ranges will be discovered as part of the EDA process itself when questions are asked and answered. Sometimes, what can appear as errors initially might turn out to be useful data that forms a critical part of the analysis.
Two of the three groups also bypass the JFK and Manhattan restrictions in place for the green taxi sector. To ascertain if 3. is outside of the Upper Manhattan pickup restrictions, we will have to drill down to the community level using the longitude and latitude coordinates provided.
The data actually showed an interesting, unexpected pattern of high average fare trips that are not in fact JFK-based. These are standard rate trips from Queens and are high for both cash and credit card customers serviced by the yellow taxi sector. The green sector can target the segment(s) in which yellow taxis make the most money.
In regards to JFK trips, it might not make sense for green taxis to target this segment as the increase in average fares is only $5-$6 than that earned from standard rate customers. They would also have the additional challenge of overcoming existing JFK street-hail regulation prohibiting them from targeting these customers.
This dataset is the perfect candidate for additional analysis doing the following:-
Using a larger dataset of sample to confirm this initial analysis. Some sub-sample sizes were rather small and it is advantageous to increase the overall starting sample size from 200,000 rows to reinforce or confirm the analysis done in this project. The original dataset is over 200 million rows. This is also perfect for adding additional variables based on weather conditions seen throughout a calendar year and noting the effect it has on trips and fares.
Another way to varify the analysis is to to a random sampling of data approximately equal in size to this sample. Choose a diffrent month/year also at random. Next conduct a hypothesis test to see if the anlaysis of the first sample was statistically significant. Of course variations due to weather could lead to a rejection of the null hypothesis.
Location data is in the form of latitude and longitude. This is the perfect opportunity to drill down from the borough to the neighborhood level to gain even further insight about the origin of these customers and the type of locations or areas where a trip begins.
Additional market research using primary sources (taxi drivers and customers) is also an option for further analysis. More information about customers can be added to the analysis/model.
Building a better cluster model by weighting and/or oversampling any smaller sub-sample sizes. By default the green taxis will always have a smaller dataset because that sector is smaller and newer. Any cluster model should take this into consideration by either oversampling the green data or reducing the dominance of yellow taxis especially for trips starting in Manhattan â via weights and/or undersampling.
This entire process also reflects the iterative process involved in data science. Initially you get a picture of what is happening, but then it also creates more questions. To answer these questions might require more data and techniques outside of EDA (model building). This is where the statistics course (from Udacity) will come in handy, creating surveys or designing experiments to help answer the additional question will also prove valuable. The results of the surveys and any hypothesis testing can be combine to form the next stage EDA to gain further insight. In retrospect this is how it works in the real world, the value of having access to business or domain knowledge, hypothesis testing (A/B etc.) and/or sufficient data will drive insights into any analytics project.